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AN ASYMPTOTIC SAMPLING FORMULA FOR THE 
COALESCENT WITH RECOMBINATION 

By Paul A. Jenkins 1 and Yun S. Song 1,2 

University of California, Berkeley 

Ewens sampling formula (ESF) is a one-parameter family of prob- 
ability distributions with a number of intriguing combinatorial con- 
nections. This elegant closed-form formula first arose in biology as 
the stationary probability distribution of a sample configuration at 
one locus under the infinite-alleles model of mutation. Since its dis- 
covery in the early 1970s, the ESF has been used in various biological 
applications, and has sparked several interesting mathematical gener- 
alizations. In the population genetics community, extending the un- 
derlying random-mating model to include recombination has received 
much attention in the past, but no general closed- form sampling for- 
mula is currently known even for the simplest extension, that is, a 
model with two loci. In this paper, we show that it is possible to 
obtain useful closed-form results in the case the population-scaled 
recombination rate p is large but not necessarily infinite. Specifically, 
we consider an asymptotic expansion of the two-locus sampling for- 
mula in inverse powers of p and obtain closed-form expressions for the 
first few terms in the expansion. Our asymptotic sampling formula 
applies to arbitrary sample sizes and configurations. 

1. Introduction. The probability of a sample configuration provides a 
useful ground for analyzing genetic data. Popular applications include ob- 
taining maximum likelihood estimates of model parameters and perform- 
ing ancestral inference [see Stephens (2001)]. In principle, model-based full- 
likelihood analyses, such as that based on the coalescent [Kingman (1982a, 
1982b)], should be among the most powerful methods since they make full 
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use of the data. However, in most cases, it is intractable to obtain a closed- 
form formula for the probability of a given data set. A well-known exception 
to this hurdle is the Ewens sampling formula (ESF), which describes the 
stationary probability distribution of a sample configuration under the one- 
locus infinite-alleles model in the diffusion limit [Ewens (1972)]. Notable 
biological applications of this closed-form formula include the test of selec- 
tive neutrality [see Watterson (1977), Slatkin (1994, 1996)]. Hoppe (1984) 
provided a Polya-like urn model interpretation of the formula, and recently 
Griffiths and Lessard (2005) provided a new combinatorial proof of the ESF 
and extended the framework to obtain new results for the case with a vari- 
able population size. We refer the reader to the latter paper for a nice sum- 
mary of previous works related to the ESF. Note that the ESF also arises 
in several interesting contexts outside biology, including random partition 
structures and Bayesian statistics; see Arratia, Barbour and Tavare (2003) 
for examples of intricate combinatorial connections. The ESF is a special 
case of the two-parameter sampling formula constructed by Pitman (1992, 
1995) for exchangeable random partitions. 

Golding (1984) considered generalizing the infinite-alleles model to in- 
clude recombination and constructed a recursion relation satisfied by the 
two-locus sampling probability distribution at stationarity in the diffusion 
limit. Ethier and Griffiths (1990) later undertook a more mathematical anal- 
ysis of the two-locus model and provided several interesting results. However, 
to date, a general closed-form formula for the two-locus sampling distribu- 
tion remains unknown. Indeed, it is widely recognized that recombination 
adds a formidably challenging layer of complexity to population genetics 
analysis. Because obtaining exact analytic results in the presence of recom- 
bination is difficult, recent research has focused on developing sophisticated 
and computationally-intensive Monte Carlo techniques. Examples of such 
techniques applied to the coalescent include Monte Carlo simulations [see 
Hudson (1985, 2001)], importance sampling [see Griffiths and Marjoram 
(1996), Stephens and Donnelly (2000), Fearnhead and Donnelly (2001), De 
Iorio and Griffiths (2004a, 2004b), Griffiths, Jenkins and Song (2008)] and 
Markov chain Monte Carlo methods [see Kuhner, Yamato and Felsenstein 
(2000), Nielsen (2000), Wang and Rannala (2008)]. 

Being the simplest model with recombination, the two-locus case has been 
extensively studied in the past [Griffiths (1981), Golding (1984), Hudson 
(1985), Ethier and Griffiths (1990), Griffiths (1991)] and a renewed wave of 
interest was recently sparked by Hudson (2001), who proposed a composite 
likelihood method which uses two-locus sampling probabilities as building 
blocks. LDhat, a widely- used software package for estimating recombination 
rates, is based on this composite likelihood approach, and it has been used 
to produce a fine-scale map of recombination rate variation in the human 
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genome [McVean et al. (2004), Myers et al. (2005)]. LDhat assumes a sym- 
metric diallelic recurrent mutation model at each locus and relies on the 
importance sampling scheme proposed by Fearnhead and Donnelly (2001) 
for the coalescent with recombination, to generate exhaustive lookup tables 
containing two-locus probabilities for all inequivalent sample configurations 
and a range of relevant parameter values. This process of generating exhaus- 
tive lookup tables is very computationally expensive. A fast and accurate 
method of estimating two-locus probabilities would be of practical value. 

In this paper, we revisit the tantalizing open question of whether a closed- 
form sampling formula can be found for the coalescent with recombination. 
We show that, at least for the two-locus infinite-alleles model with the 
population-scaled recombination rate p large but not necessarily infinite, 
it is possible to obtain useful closed-form analytic results. Note that the 
aforementioned Monte Carlo methods generally become less efficient as p 
increases. Those methods involve sampling a large collection of genealogical 
histories consistent with the observed sample configuration, and, when p is 
large, the sampled genealogies tend to be very complicated; they typically 
contain many recombination events, and it may take a long time for every 
locus to reach a most recent common ancestor. However, contrary to this 
increased complexity in the standard coalescent, we actually expect the evo- 
lutionary dynamics to be easier to describe for large p, since the loci under 
consideration would then be less dependent. Hence, it seems reasonable to 
conjecture that there may exist a stochastic process simpler than the stan- 
dard coalescent with recombination that describes the relevant degrees of 
freedom in the large p limit. We believe that our sampling formula may 
provide some hints as to what that dual process should be. 

The work discussed here generalizes previous results [Golding (1984), 
Ethier and Griffiths (1990)] for p = oo, in which case the loci become in- 
dependent and the two-locus sampling distribution is given by a product of 
one-locus ESFs. Our main results can be summarized as follows. 

Main results. Consider the diffusion limit of the two-locus infinite-alleles 
model with population-scaled mutation rates 9 a and 9b at the two loci. For 
a sample configuration n (defined later in the text), we use q(n | 9a, 9b, p) to 
denote the probability of observing n given the parameters 9a, 9b and p. For 
an arbitrary n, our goal is to find an asymptotic expansion of q(n \ 9a,9b,p) 
in inverse powers of p, that is, for large values of the recombination rate p, 
our goal is to find 

r \ a a ^ t i a a \ , Qi(^\Ga,9 b ) . q2(n\9 A ,9 B ) , rt f^\ 
q{n I 9 A ,9 B ,p) = <7o(n | 9 a, 9b) H 1 5 h 0[ , 

p p l Vp / 

where qo, qi, and qi are independent of p. As mentioned before, qo(n \ 9a, 9b) 
is given by a product of one-locus ESFs. In this paper, we derive a closed- 
form formula for the first-order term qi(n | 9 a, 9b)- Further, we show that 
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the second-order term q2(n \ 9a, Ob) can be decomposed into two parts, one 
for which we obtain a closed-form formula and the other that satisfies a 
simple strict recursion. The latter can be easily evaluated using dynamic 
programming. Details of these results are described in Section 3. In a sim- 
ilar vein, in Section 4, we obtain a simple asymptotic formula for the joint 
probability distribution of the number of alleles observed at the two loci. 

We remark that our work has practical value in genetic analysis. While 
this paper was under review, we applied the technique developed here to 
obtain analogous results for an arbitrary finite-alleles recurrent mutation 
model. See Jenkins and Song (2009) for details. In that paper, we performed 
an extensive assessment of the accuracy of our results for a particular finite- 
alleles model of mutation, and showed that they may be accurate even for 
moderate values of p, including a range that is of biological interest. The 
accuracy (not discussed here) of our results for the infinite-alleles model is 
very similar to that finite-alleles case. 

2. Preliminaries. In this section, we review the ESF for the one-locus 
infinite-alleles model, as well as Golding's (1984) recursion relation for the 
two-locus generalization. Our notational convention generally follows that 
of Ethier and Griffiths (1990). 

Given a positive integer k, [k] denotes the k-set {1, . . . , k}. For a nonneg- 
ative real number x and a positive integer n, {x) n := x{x + 1) • • • (x + n — 1) 
denotes the nth ascending factorial of x. We use to denote either a vector 
or a matrix of all zeroes; it will be clear from context which is intended. 
Throughout, we consider the diffusion limit of a neutral haploid exchange- 
able model of random mating with constant population size 2N. We refer 
to the haploid individuals in the population as gametes. 

2.1. Ewens sampling formula for the one-locus model. In the one-locus 
model, a sample configuration is denoted by a vector of multiplicities n = 
(ni, . . . ,tik), where ni denotes the number of gametes with allele i at the 
locus and K denotes the total number of distinct allelic types observed. We 
use n to denote X^i n «> the total sample size. Under the infinite-alleles 
model, any two gametes can be compared to determine whether or not they 
have the same allele, but it is not possible to determine how the alleles 
are related when they are different. Therefore, allelic label is arbitrary. The 
probability of a mutation event at the locus per gamete per generation is 
denoted by u. In the diffusion limit, N — > oo and u — > with the population- 
scaled mutation rate 6 = ANu held fixed. Each mutation gives rise to a new 
allele that has never been seen before in the population. For the one-locus 
model just described, Ewens (1972) obtained the following result. 
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Proposition 2.1 (Ewens). At stationarity in the diffusion limit of the 
one-locus infinite- alleles model with the scaled mutation parameter 9, the 
probability of an unordered sample configuration n = (ni, . . . ,tik) is given 
by 

(2.1) P(n|0) 



1 



fif-nic "i! ■■•««! {9)n 



where on denotes the number of allele types represented i times, that is, 
on := \{k | n k = i}\. 



Let g/ n denote an ordered configuration of n sequentially sampled ga- 
metes such that the corresponding unordered configuration is given by n. 
By exchangeability, the probability of s# n is invariant under all permutations 
of the sampling order. Hence, we can write this probability of an ordered 
sample as q(n) without ambiguity. It is given by 



(2.2) q(n 



1 



A 



IK-1)! 



i=l 



3 K 



(*)»' 



which follows from the fact that there are 



orderings corre- 



sponding to n [Hoppe (1984)]. To understand what we mean by ordered and 
unordered sampling configurations, it is helpful to relate the Ewens sam- 
pling formula to the theory of random partitions. If the gametes are labeled 
in order of appearance by 1, . . . ,n, then the resulting sample configuration 
defines a random partition of [n] , with gametes belonging to the same block 
if and only if they have the same allele. The quantity q(n \ 9) is then the 
probability of a particular partition of [n] whose block sizes are given by the 
entries in n, while the quantity p(n | 9) is the probability of observing any 
partition of [n] with these block sizes. For example, if n = (2, 1, 1), then there 
are six partitions of [4] with these block sizes, and so p(n | 9) = 6(7(11 | 9). 
It is often more convenient to work with an ordered sample than with an 
unordered sample. In this paper, we will work with the former; that is, we 
will work with q(n \ 9) rather than p(n \ 9). 

In the coalescent process going backward in time, at each event a lineage 
is lost either by coalescence or mutation. By consideration of the most recent 
event back in time, one can show that q(n | 9) satisfies 



A' 



n(n - 1 + 9)q(n \ 9) =^ Uiirii - l)q(n - e,/ | 9) 



(2.3) 



i=l 



A 



i=l 
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where S nit i is the Kronecker delta and e, is a unit vector with the ith entry 
equal to one and all other entries equal to zero. The boundary condition is 
q(ei | 9) = 1 for all i G [K] , and g(n | 9) is defined to be zero if n contains 
any negative component. It can be easily verified that the formula of q(n \ 9) 
shown in (2.2) satisfies the recursion (2.3). 

Ewens (1972) also obtained the following result regarding the number of 
allelic types. 

Proposition 2.2 (Ewens). Let K n denote the number of distinct allelic 
types observed in a sample of size n. Then 



where s(n,k) are the unsigned Stirling numbers of the first kind. Note that 
{9) n = s(n, 1)9 + s(n, 2)6 2 + ■■■ + s{n, n)9 n . 

It follows from (2.1) and (2.4) that K n is a sufficient statistic for 9. 

2.2. Golding's recursion for the two-locus case. Golding (1984) first gen- 
eralized the one-locus recursion (2.3) to two loci, and Ethier and Griffiths 
(1990) later undertook a more mathematical study of the model. We de- 
note the two loci by A and B, and use 9a and 6b to denote the respective 
population-scaled mutation rates. We use K and L to denote the number 
of distinct allelic types observed at locus A and locus B, respectively. The 
population-scaled recombination rate is denoted by p = 4Nr, where r is the 
probability of a recombination event between the two loci per gamete per 
generation. A key observation is that to obtain a closed system of equations, 
the type space must be extended to allow some gametes to be specified only 
at one of the two loci. 

Definition 2.1 (Extended sample configuration for two loci). The two- 
locus sample configuration is denoted by n = (a, b, c), where a = (oi, . . . , ax) 
with ai being the number of gametes with allele i at locus A and unspecified 
alleles at locus B, b = (pi, . . . , bi) with bj being the number of gametes with 
unspecified alleles at locus A and allele j at locus B, and c = (c^ ) is a K x L 
matrix with c« being the multiplicity of gametes with allele i at locus A and 
allele j at locus B. Further, we define 



(2.4) 



F(K n = k\6) 



s(n,k)9 k 



K 



L 



A 



L 






i=i j=i 



L 



K 



b 





n = a + b + c. 



ASYMPTOTIC SAMPLING FORMULA FOR A TWO-LOCUS MODEL 



7 



We use g(a, b,c) to denote the sampling probability of an ordered sam- 
ple with configuration (a, b,c). For ease of notation, we do not show the 
dependence on parameters. For < p < oo, Golding's (1984) recursion for 
q(a, b,c) takes the following form: 

[n(n - 1) + A (a + c) + B (b + c) + pc]q(a, b, c) 

K 

= s ^a i (a i - 1 + 2cj.)g(a- ej,b,c) 

L 

+ ^2 b j( b j -l + 2c. i )g(a,b-e 3 -,c) 

3=1 
K L 

+ ^2^2[cij{cij - l)g(a,b,c-e ii ) 
i=i j=i 

(2.5) + 2aibjq(a - e^, b - e, , c + e^)] 

K r L 

+ ° A Yl Sa * +«■> 1 Sc *J . 1 q b + e 3 ' C ~ ei 3 ) 

i=l lj=l 

+ <W<^.,og(a-ei,b,c) 

' K 

^ <^+c.,-,i£c»,ig(a + e,, b, c - ejj) 
,i=i 

+ \,i5 c . J , g(a,b-e j ,c) 

X L 

+ ^ Cjjg(a + e,, b + e 3 -, c - e^). 
i=i i=i 

Relevant boundary conditions are g(ej,0,0) = g(0,ej,0) = 1 for all i G [if] 
and j € [L]. For notational convenience, we deviate from Ethier and Grif- 
fiths (1990) and allow each summation to range over all allelic types. To 
be consistent, we define q(a, b,c) = whenever any entry in a, b or c is 
negative. 

For ease of discussion, we define the following terms. 

Definition 2.2 (Degree). The degree of q(a, b,c) is defined to be a + 
6 + 2c. 



7 = 1 
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Definition 2.3 (Strictly recursive). We say that a recursion relation is 
strictly recursive if it contains only a single term of the highest degree. 



Except in the special case p = oo, a closed-form solution for q(a, b, c) is not 
known. Notice that the terms c/(a — ej, b — e.,-, c + ejj) and (?(a+ej, b + e,,, c — 
ejj) on the right-hand side of (2.5) have the same degree as q(a, b,c) on the 
left-hand side. Therefore, (2.5) is not strictly recursive. For each degree, 
we therefore need to solve a system of coupled equations, and this system 
grows very rapidly with n. For example, for a sample with a = 0, b = and 
c = 40, computing q(0, 0, c) requires solving a system of more than 20,000 
coupled equations [Hudson (2001)]; this is around the limit of sample sizes 
that can be handled in a reasonable time. In the following section, we revisit 
the problem of obtaining a closed-form formula for q(a, b,c) and obtain an 
asymptotic expansion for large p. 



3. An asymptotic sampling formula for the two-locus case. For large p, 
our objective is to find an asymptotic expansion of the form 

/q -i \ r -u \ t u \ , <?i( a ' b ' c ) . <?2(a,b,c) /1\ 

(3.1 q a,b,c) = q a,b,c + H s h0 ~s ' 

p p z VP / 

where qo,q± and 92 are independent of p. Our closed- form formulas will be 
expressed using the following notation. 



Definition 3.1. For a given multiplicity vector a= (01, . . . ,0^) with 
a = X2t=i a «' we define 



(3.2) 



q A (a) 



A 



i=l 



9 A- 



Similarly, for a given multiplicity vector b = (61, ... , bi) with 6 = X^=i 
we define 



(3.3) 



g B (b) 



j=i 



1 B 



{0 B )b 



As discussed in Section 2.1, q A (respectively, q B ) gives the probability of an 
ordered sample taken from locus A (respectively, B). 



Definition 3.2 (Marginal configuration). We use ca = (ci-)ie\K] an d 
cb = ( c -j)je[L] to denote the marginal sample configurations of c restricted 
to locus A and locus B, respectively. 
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The leading-order term qo(a,h,c) is equal to g(a,b,c) when p = oo, in 
which case the two loci are independent. Theorem 2.3 of Ethier and Griffiths 
(1990) states that qo(0, 0, c) = q A (cA)q B (cb)- More generally, one can obtain 
the following result for the leading-order contribution. 

Proposition 3.1. In the asymptotic expansion (3.1) of the two-locus 
sampling formula, the zeroth-order term f/o(a, b,c) is given by 

(3.4) q (a, b, c) = q A (a + c A )q B (h + c B ). 

Although this result is intuitively obvious, in Section 5.1 we provide a 
detailed new proof, since it well illustrates our general strategy. One of the 
main results of this paper is a closed-form formula for the next order term 
qi(a, b,c). The case with c = admits a particularly simple solution. 

Lemma 3.1. In the asymptotic expansion (3.1) of the two-locus sampling 
formula, the first-order term satisfies 

9 i(a,b,0) = 

for arbitrary a and b. 

That (/i (a, b, 0) vanishes is not expected a priori. Below we shall see that 
q2(a, b, 0) ^ in general. For an arbitrary configuration matrix c of nonneg- 
ative integers, we obtain the following closed-form formula for gi(a,b,c). 



Theorem 3.1. In the asymptotic expansion (3.1) of the two-locus sam- 
pling formula, the first- order term qi(a, b,c) is given by 

qi (a, b,c) = ( ^ J q A (a + c A )q B (b + c B ) 



(3.5) 



g > + c B )fV C |V(a + <^-e i ) 
/(a + c^W^'V^ + CB-e,) 

.7=1 V ' 



+ E E [°2 ) qA{a +CA ~ e * )qB{h +CB ~ e ^ 



t=l j=l 

for arbitrary configurations a, b, c of nonnegative integers. 
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Lemma 3.1 is used in proving Theorem 3.1. A proof of Theorem 3.1 is 
provided in Section 5.2, while a proof of Lemma 3.1 is given in Section 5.3. 
Note that the functional form of go(a,b, c) and Qi(a, b,c) in (3.4) and (3.5) 
has no explicit dependence on mutation; that is, the dependence on mu- 
tation is completely absorbed into the marginal one-locus probabilities. It 
turns out that (3.4) and (3.5) are universal in that they also apply to an 
arbitrary finite-alleles model of mutation, with q A and q B replaced with ap- 
propriate marginal one-locus probabilities for the assumed mutation model. 
See Jenkins and Song (2009) for details. 

In principle, similar arguments can be used to find the (j + l)th-order 
term given the jth, although a general expression does not seem to be easy 
to obtain. In Section 5.4, we provide a proof of the following result for 
<72(a,b,c). 



Theorem 3.2. In the asymptotic expansion (3.1) of the two-locus sam- 
pling formula, the second-order term q2(a, b,c) is of the form 



(3.6) 



g 2 (a, b, c) = q 2 (a + c A , b + c B , 0) + cr(a, b, c) 



where cr(a,b,c) is given by the analytic formula shown in the Appendix, and 
q 2 (&, b,0) satisfies the following strict recursion: 

[a(a + A - 1) + b(b + 6 B - l)]g 2 (a, b, 0) 

K L 

= ^2 ai (ai - l)g 2 (a-ei,b,0) + ^b j (b j - l)<y 2 (a, b - ej, 0) 

i=i j=i 

K L 

(3.7) + 6 A S ai:1 q 2 (a - e h b, 0) + 9 B ^ 6 bjA q 2 (sL, b - ej, 0) 

i=i j=i 

K 



+ 4 



1 A + a 



i=l 



IB 



+ &-!)£' 



with boundary conditions q 2 (^i, 0, 0) = f/ 2 (0, ej, 

je[4 



q A (*)q B (b) 

for all i G [K] and 



In contrast to f/i(a, b, 0) (cf. Lemma 3.1), it turns out that q 2 (a., b, 0) does 
not vanish in general. We do not have an analytic solution for ^(a, b, 0), but 
note that (3.7) is strictly recursive and that it can be easily solved numer- 
ically using dynamic programming. Numerical study (not shown) suggests 
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that the relative contribution of + C A, b + eg, 0) to q(a, b, c) is in most 
cases extremely small. [See Jenkins and Song (2009) for details.] Deriving 
an analytic expression for cr(a, b, c) in (3.6) is a laborious task, as the long 
equation in the Appendix suggests. We have written a computer program 
to verify numerically that our analytic result is correct. 

4. Joint distribution of the number of alleles at the two loci in a sample. 

Following the same strategy as in the previous section, we can obtain the 
asymptotic behavior of the joint distribution of the number of alleles ob- 
served at the two loci in a sample. To make explicit the dependence of these 
numbers on the sample size, write the number of alleles at locus A as K a b iC 
and the number of alleles at locus B as L a ,6,c- Ethier and Griffiths (1990) 
proved that the probability p(a,b,c\k,l) := F(K a ^ jC = k,L a ^ c = I) satisfies 
the recursion 

[n(n - 1) + 9a(ci + c) + 8 B (b + c) + pc]p(a, b, c; k, I) 

= a(a — 1 + 2c)p(a — 1, b, c; k, Z) + b(b — 1 + 2c)p(a, b — 1, c; k, I) 

+ c(c — l)p(a, b, c — 1; k, Z) + 2abp(a — 1, b — 1, c + 1; k, I) 

(4.1) + 9 A [ap{a — l,b,c;k— 1, 1) + cp(a, b+l,c — l;k — 1,1)] 

+ 0B[bp(a, b — 1, c; k, I — 1) + cp(a + 1, 6, c — 1; k, I — 1)] 

+ pcp(a + 1,6+ l,c — 1;M), 

where p(a, 6, c; fc, Z) = if a < 0, 6 < 0, c < 0, < 0, I < 0, a = b = c = 0, or k = 
I = 0. Equation (4.1) has a unique solution satisfying the initial conditions 

p(l, 0, 0; fc, = £fc,i£j,o, p(0, 1, 0; fc, Z) = <5 fc , 4i 

for A;, Z = 0, 1, . . . , n. 

As with Golding's recursion, equation (4.1) can be solved numerically, 
but quickly becomes computationally intractable with growing n. The only 
exception is the special case of p = oo, for which the distribution is given 
by the product of (2.4) for each locus. In what follows, we use the following 
notation in writing an asymptotic series for p(a,b,c;k,l). 

Definition 4.1. For loci A and B, respectively, we define the analogues 
of (2.4) as 



(4.2) p A {a-k) 



s(a, k)0\ 



(0 A )a 

and 

(4.3) P >;0 



(Ms 

where s(a,k) and s(b,l) are the Stirling numbers of the first kind. 
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We pose the expansion 

(4.4) p(a, b, c; k, I) = Po (a, b, c; k, I) + P^ c ' k ^ + 

P 

for large p. Then, in Section 5.5, we prove the following result for the zeroth- 
order term. 

Proposition 4.1. For an asymptotic expansion of the form (4-4) sat- 
isfying the recursion (4-1), Po(a,b,c;k,l) is given by 

(4.5) p (a,b,c;k,l) = p A (a + c;k)p B (b + c;l). 

Similar to Lemma 3.1, we obtain the following vanishing result for the 
first-order term in the case of c = 0. 

Lemma 4.1. For an asymptotic expansion of the form (4-4) satisfying 
the recursion (4-1), we have 

p 1 (a,b,0;k,l) = 0. 

Using this lemma, it is then possible to obtain the following result for an 
arbitrary c. 

Proposition 4.2. For an asymptotic expansion of the form (4-4) sat- 
isfying the recursion (4-1), Pi(a,b,c;kJ) is given by 

pi (a, b, c; k, I) = C( ' C ~ 1 - ) [p A (a + c; k) - p A (a + c - 1; k)] 

(4.6) 

x [p B (b + c-l)-p B {b + c-l;l)\. 

Proofs of Proposition 4.2 and Lemma 4.1 are provided in Sections 5.6 and 
5.7, respectively. 

5. Proofs of main results. In what follows, we provide proofs of the re- 
sults mentioned in the previous two sections. 

5.1. Proof of Proposition 3.1. First, assume c > 0. Substitute the expan- 
sion (3.1) into Golding's recursion (2.5), divide by pc and let p — s> oo. We 
are then left with 

K L 

(5.1) q Q (a.,b,c) = ^2^2 - 1 q (a + e h b + ej,c - eij). 

i=ii=i c 
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Now, applying (5.1) repeatedly gives 

i u \ n(ij)e[i<rix[L] °iy , , n , 

Q (a,b,c)= 2^ c , q (a + c A ,b + c B ,0), 

ordcrings 

where the summation is over all distinct orderings of the c gametes with mul- 
tiplicity c = (cjj ) . There are ^ c! - , such orderings and since the summand 

is independent of the ordering, we conclude 

(5.2) g (a,b,c) = q (a. + c A ,h + c B ,0). 

Clearly, (5.2) also holds for c = 0. From a coalescent perspective, this equa- 
tion tells us that any gamete with specified alleles (i.e., "carrying ancestral 
material") at both loci must undergo recombination instantaneously back- 
ward in time. 

Now, by substituting the asymptotic expansion (3.1) with c = into Gold- 
ing's recursion (2.5) and letting p — > oo, we obtain 

[n(n - 1) + 6 A a + 6> B % (a, b, 0) 

K L 

= s ^a i {a i - l)g (a- ej,b,0) + S ^b j (b j - l)g (a,b- ej,0) 

i=i j=i 

(5.3) 

K L 

+ 2 aibjq (a - e h h - ej, e»j) 

i=i j=i 

K L 

+ 0a^2 <W9o(a - e { , b, 0) + 6 B ^ d^^q^a, b - e j: 0). 
i=i j=i 

Equation (5.2) implies go(a — e,/,b — ej,ejj) = go( a , b,0), so with a bit of 
rearranging we are left with 

[a(o + A - 1) + &(& + ^ " 1 )]«o(a, b, 0) 

K L 

(5.4) =^ ai ( 0i - l)q (&- ei,h,0) + ^2bj(bj - l)g (a, b - ej, 0) 

i=i i=i 

+ #A ^ ^ ai ,i9o(a - b, 0) + 6>b ^ <5 6ji ig (a, b - ej, 0) 
i=i j=i 

with boundary conditions qo(ei,0,0) = qo(0,ej,0) = 1 for all i € [K] and 
j € [L]. Noting that (5.4) is the sum of two independent recursions of the 
form (2.3), one for each locus and each with appropriate boundary condition, 
we conclude that go( a , b,0) is given by 

(5.5) <7o(a,b,0) = /(a)g s (b), 
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a product of two (ordered) ESFs. It is straightforward to verify that (5.5) 
satisfies (5.4). Finally, using (5.2) and (5.5), we arrive at (3.4). 

5.2. Proof of Theorem 3.1. First, assume c > 0. Substitute the asymp- 
totic expansion (3.1) into Golding's recursion (2.5), eliminate terms of order 
p by applying (5.1), and let p — > oo. After applying (5.2) to the remaining 
terms and invoking (5.4), with some rearrangement we obtain 

K L 

cqi (a, b, c) - } } Cj 3 qx (a + e^, b + e j} c - e^-) 
i=i j=i 

= c(c- l)g (a + CA,b + CB,0) 

K 

-y)cj-(ci- - l)go(a + c j4 -ei,b + c B ,0) 

i=l 
L 

~y^, c j( c -j - l)<7o(a + CA,b + CB -ej,0) 

K L 

+ / J / J Cjj(cjj ~ l)go(a + CA - ei,b + c B -ej,0). 

i=l j=l 

Now, by utilizing (5.5), this can be written in the form 

K L 

(5.6) gi(a,b,c) = /(a,b,c) + ^^^-gi(a + e i ,b + e i ,c - e^), 

i=i j=i c 

where 

/(a, b, c) := (c - l)q A (a + c A )q B (h + c B ) 



(5.7) 



Bru , n c i-( c i- — 1) At , \ 

- ^(b + c B ) > ^(a + c A - e») 

8=1 

- ^(a + c A ) £ c -i^ - 1 ) g B (b + Cfi _ e . } 

i=i C 

+ E Qj(Qj " 1) g A (a + Cj4 - eO^Cb + c B - e,; 



i=i j=i 

Above, we assumed c > 0. We define /(a, b,c) = if c = 0. Iterating the 
recursion (5.6), we may write (?i(a, b,c) as 

<?i(a,b,c) = /(a,b,c) 
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K L 



i=i j=i 



f(a + ei,b + ej,c-eij) 



K L j , 

Ci'j' — <Jii'Ojj> 



c- 1 

i'=i j'=i 



x qi (a + e,j + , b + a,- + ey 



Similarly, repeatedly iterating (5.6) yields 

gi(a,b,c) =gi(a + CA,b + CB,0) + /(a,b,c) 

+ ^ ^-/(a + , b + , c - ) 

hji 

(5 8) _)_ c nii c »2j2 ~ ^iji,»2j2 

iljl 1*2.72 

* /(a "I - ~l~ ^t2 1 ^ ~l~ ®ji "I - ®ja i ^ ®tiii ^iiji ) 

+ •••+ E ^pW + c A ,b + c B ,0). 

hji,—,icjc 

The key observation is that the right-hand side of (5.8) has a nice proba- 
bilistic interpretation which allows us to obtain a closed-form formula. To 
be more precise, consider the first summation 

/ (a + e h , b + e h , c - e hjl ) . 

hjl 

For a fixed sample configuration c, this can be interpreted as the sum over all 
possible ways of throwing away a gamete at random and calculating / based 
on the remaining subsample, which we will denote c( c_1 ). Equivalently, it is 
the expected value of / with respect to subsampling without replacement 
c — 1 of the gametes in c. Write this as 

E [_f( A (c-l) )B (c-l) )C (c-l))] 5 

where C^ -1 -* is the random subsample obtained by sampling without re- 
placement c — 1 gametes from c, and A( c_1 ) := a + ca — 1 \ B( c_1 ) := 
b + CB-C^ _1) . Note that once the subsample c^ c ^ is obtained, then a^ c ^ 
and b( c_1 ) are fully specified. More generally, consider the (c — m)th sum 
in (5.8). A particular term in the summation corresponds to an ordering 
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of c — m gametes in c, which, when removed leave a subsample c( m \ With 
respect to this subsample, the summand is 

K L . . 

C,-! m! 



1=1,7=1 c ij 



and for each such subsample there are ( ° - (m)) distinct orderings of the 
remaining types in c, with each ordering contributing the same amount to 
the sum. Here, ( denotes the multinomial coefficient: 

c-m \ (c — m)\ 



Gathering identical terms, the (c — m)th sum in (5.8) can therefore be writ- 
ten over all distinct subsamples of c of size m: 

E( t 7-))nn^7^"' bl *' cW ' 

c («0 V 7 i=l j=l c ij ■ 

EAnn( c M)/( a(m) ' b(m) ' c(m) ; 

-t™\ \m) i=\ j=\ \ H ' 



a (m) \m) j=l j=l v tj 

E[/(A (m) ,B (m) ,C (m) 
random variable; that is 



where, for a fixed m, = (Cj;- ) is a multivariate hypergeometric(c, c, m) 



n prMr'])^ n u<™> 

\ij)e[Jf]x[L] 7 W (ij)e[X]x[£] v « 

Furthermore, marginally we have 

Cjj ~ hypergeometric(c, , m) , 
~ hypergeometric(c, Cj. , m) , 

C.j- ~ hypergeometric(c, c.j,m). 
In summary, (5.8) can be written as 

c 

(5.9) g 1 (a,b,c) = g 1 (a + c A ,b + c B ,0)+^E[/(A( m ),B( m ),C( m ))]. 

m=l 

According to Lemma 3.1, the first term q±(a + c^,b + cb ,0) vanishes, so we 
are left with 

c 

(5.10) gi(a,b,c) = ^E[/(A( m ),B( m ),C( m ))]. 

m=l 
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Finally, since + C^ m) = a + c A and B( m ) + = b + c B , (5.7) and 
(5.10) together imply 



?i(a,b,c) 

c 

= £ 

m=l 



(m-l)q A (a + c A )q B (b + c B ) 

- q B (b + c fl )^ ^Efcf^cf > - l)]/(a + c A - e 4 ) 

i=l 

- /(a + c A )-jy{C%\d<p - l)]q B (h + c B - e,) 

171 3=1 



m ■ 

i=i j=i 

The moments in this equation are easy to compute and one can sum them 
over m to obtain the desired result (3.5). 

5.3. Proof of Lemma 3.1. First, note that for any sample (a, b,c) and 
any subsample of the form (a^b^c^ 1 )), we have f{sS l \h^ l \c^) = 0, 
since every term on right-hand side of (5.7) has a vanishing coefficient. So, 
equation (5.9) implies 

(5.11) g 1 (a-e i ,b-ej,e ij -) =gi(a,b,0) 

for any G [K] x [L]. Now, substitute the asymptotic expansion (3.1) 
with c = into Golding's recursion (2.5). Note that terms of order p are 
absent since c = 0. Eliminate terms with coefficients independent of p by 
applying (5.3), multiply both sides of the recursion by p, and let p — > oo to 
obtain the following: 

[n(n- 1) + 6 A a + 9 B b]q 1 (8L,b,0) 

K L 

= ^2ai(ai - l)gi(a-e i) b,0) +^b j (b j - l)gi(a,b - a,,0) 

i=i j=i 

K L 

+ 2 ajbjqi (a - e,, b - e^e^) 
i=i j=i 

K L 

+ 9 A ^2 K ,1 qi (a - e, , b, 0) + 6 B ^ S bjjl qi (a, b - e 3 ; , 0) 

i=i i=i 
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with boundary conditions qi(ei, 0,0) = (/i(0,e,,0) = for all i £ [K] and 
j € [L]. This equation can be made strictly recursive by applying (5.11) to 
gi(a — ej,b — e,-,e,j). It therefore follows from the boundary conditions (for 
example, by induction) that gi(a, b,0) = 0. 

5.4. Proof of Theorem 3.2. Here, we provide only an outline of a proof; 
details are similar to the proof of Theorem 3.1. Substitute the asymptotic 
expansion (3.1) into Golding's recursion (2.5), eliminate terms with coeffi- 
cients proportional to p or independent of p. Then, multiply both sides of 
the recursion by p and let p — > oo to obtain 

K L 

cq 2 (a, b,c) ~ y^y~] Cijq 2 (& + e i; b + e,-, c - e^) 
i=\ j=i 

K 



(5.12) 



^Ja^a, - 1 + 2cj.)gi(a- e,j,b,c) 

i=X 

L 

+ ^2 bj (bj - 1 + 2c.j)qi (a, b - e, , c) 

K L 

+ ^2^2[cij{cij -l)?i(a,b,c-eij) 

i=l j=l 



K 



i=l 



+ 2a i bjq 1 (a-ei,b-e j ,c + e ij )] 
■ L 

E^a i +c i .,i'5c iJ ,iQ , i(a,b + e,-,c - e^) 



A" 

.t=i 



+ £a j) i <5 c i .,o<?i(a-e i ,b,c) 



,-,i^cij,iQi(a + ei,b,c- ejj) 



•6& 3 -,i<5c.j,o?i(a>b-ej,c) 



- [n(n - I) + 9 A (a + c) + 9 B (b + c)]qi(a,b,c). 

By substituting our expression (3.5) for q\(a, b,c), the right-hand side can 
be expressed as a function g(a, b,c) which is completely known but rather 
cumbersome to write down. As in the proof of Theorem 3.1, the same "un- 
wrapping" maneuver can be applied to rearrange (5.12) into the form (3.6), 
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where 

c 

<r(a,b,c) = ^E[ 5 (A( m ),B( m \c( m ))]. 

m=l 

This time E[g{A^ m \B^ m \ C^)} is a function of fourth-order moments of 
the multivariate hypergeometric distribution. The formula shown in the 
Appendix is obtained by evaluating the expectations and summing over 
m. 

We now show that ^(a, b,0) satisfies the recursion shown in (3.7). We 
will use the fact that for a sample (a — e^, b — e^ey), we have 

g(a-ei,b-e j ,e ij ) 

= 2(a-l)(b-l)q A ( a )q B (b) 

(5.13) - 2(6 - l)(oi - l)(? A (a - ei)q B (b) 

-2(a-l)(b J -l)q A ( a )q B (b-e J ) 

+ 2(o, - lX&j- - l)q A (a - e 4 )g B (b - e,). 

For an arbitrary c, g(a, b,c) is much more complicated. 

Now, one can adopt the approach used in the proof of Lemma 3.1 to 
obtain a strict recursion for ^(a + ca, b + eg, 0). First, note that (3.6) and 

(5.13) imply 

q 2 (&- ei,b - ej,eij) 

= (?2 (a,b,0)+Eb((A-e t ) (1) ,(B-e,) (1) ,eW)] 
= q 2 (a,b,0) + g(a - ei,b - ej ,eij) 

(5.14) = q 2 (a,b, 0) + 2(a - 1)(6 - l) g A (a)g B (b) 

-2(6-l)(a i -l) (? A (a-e i )g B (b) 
_2(a-l)(6,-l) (? A (a)g B (b-e,) 
+ 2(a 4 - 1)(6, - l)/(a - ei)q B (b - e,). 

As before, substitute the asymptotic expansion (3.1) for c = into Gold- 
ing's recursion (2.5), eliminate terms with coefficients independent of p or 
proportional to p" 1 , and let p^oo to obtain 

[n(n - 1) + 9 acl + B % 2 (a, b, 0) 

K L 

= ^ai(oi - l)q 2 (a- ei,b,0) + ^2bj(bj - l)g 2 (a,b-e i ,0) 

i=i j=i 
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K L 

+ 2 Qi6jg2 (a - ej , b - ej , ) 

i=l j=l 

+ 0^ ^ ^,192 (a - e» , b, 0) + 9 B ^ 5 &i 492 (a, b - ej , 0) 

i=l j=l 

with boundary conditions (72(^,0,0) = (72(0, ej-,0) = for all i 6 [if] and 
j £ [L]. This equation can be made strictly recursive by applying (5.14) to 
(72 (a — ej,b — e,-,ej,-). After some simplification, this leads to the recursion 
(3.7). 

5.5. Proof of Proposition 4-1- The proof is similar to the proof of Propo- 
sition 3.1, working with the system (4.1) rather than Golding's recursion 
(2.5). First, assume c > 0. Substitute the expansion (4.4) into the recursion 
(4.1), divide by pc and let p — > oo. We are left with 

Po(a,b,c;k,l) = Po(a + 1, 6 + 1, c - 1; k, I), 

which implies 

(5.15) Po(a, b, c; k, I) =p (a + c,b + c, 0; k, I). 

Clearly, (5.15) also holds for c = 0. 

Now, by substituting the asymptotic expansion (4.4) with c = into (4.1) 
and letting p — > oo, we obtain 

[n(n - 1) + 6 ao + 0Bb]po(a, b, 0; k, I) 

= a(a — l)po(a — 1, b, 0; k, I) + 6(6 — l)po(a, b— l,c;k,l) 

(5.16) + 2abp (a-l,b-l,l;k,l) 

+ 9 A ap (a — l,b,0;k- 1, i) 

+ 0B6po(a,&-l,O;A:,Z- !)■ 

After invoking (5.15) on po(a — 1,6 — 1, 1; fc,Z) and rearranging, we are left 
with 

[a(a + e A -l) + b(b + 9 B - l)]p (a, b, 0; k, I) 

= a(a - l)p (a - 1, b, 0; fc, i) + b(b - l)p Q (a, 6-1,0; fc, I) 

(5.17) 

+ 0Aapo(a- 1,6,0; A; - 1,1) 
+ 9 B bp (a,b- 1,0; k,l-l) 
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with boundary conditions po(l> 0,0; k, I) = 6k,i5i t o andpo(0, 1, 0; k, I) = <5fc,o<^,i- 
Equation (5.17) can be expressed as a linear sum of two independent recur- 
sions: 

(a - 1 + 9 A )p A (a; k) = (a - l)p A (a - 1; fc) + #aPo ( a - 1; & - 1), 
(6 - 1 + 0bK(&; = (6 - l)Po(b ~ 1; + 0flPo (6 - 1; Z - 1) 

with respective boundary conditions p A (l; k) = 8k,i and p B (l; i) = fyi. These 
recursions are precisely those considered by Ewens [(1972), (21)], with re- 
spective solutions (4.2) and (4.3). Hence, p A (a;k) = p A (a;k) and p^(6;Z) = 
p B (b;l), and it is straightforward to verify that p A (a;k)p B (b;l) satisfies 
(5.17). Substituting this solution into (5.15), we arrive at (4.5), as required. 

5.6. Proof of Proposition 4-2. First, assume c > 0. Substitute the asymp- 
totic expansion (4.4) into the recursion (4.1), eliminate terms with coeffi- 
cients linear in p by applying (5.15), and let p— > oo. After applying (5.15) 
to the remaining terms and invoking (5.17), with some rearrangement we 
obtain 

Pi(a,b, c; k, I) — pi(a + 1,6 + l,c— l;k,l) 

= (c- l)[p (a + c,b + c,0;k,l) -p (a + c- 1, b + c, 0; k, I) 

(5.18) 

- p (a + c, b + c - 1, 0; k, I) 

+ Po(o + c-l,6 + c-l,0;Ar,Z)]. 

Applying the recursion repeatedly, this becomes 

pi(a,b,c;k,l) 

= pi(a + c, b + c, 0; k, I) 

+ [p {a + c,b + c,0; k, I) - p {a + c — 1, b + c, 0; A;, I) 

(5.19) 

- p (a + c, 6 + c — 1, 0; k, I) 

+ p (a + c-l,b + c-l,0;k,l)] 

c-l 

x (c — 1 — m). 

m=0 

According to Lemma 4.1, the first term pi(a + c, 6 + c, 0; k, I) vanishes. 
Hence, since po(a,b,c;k,l) is given by (4.5), the right-hand side of (5.19) is 
fully known. With some rearrangement, we are left with (4.6). 
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5.7. Proof of Lemma J^.l. First, note that (5.18) implies 

(5.20) pi(o-l,6-l,l;jfe,i) = p 1 (a,b,0;k,l). 

Now, substitute the asymptotic expansion (4.4) with c = into (4.1), elim- 
inate leading-order terms by applying (5.16), and let p — > oo. The result is 
made strictly recursive by invoking (5.20), and we obtain 

[a(a + A - 1) + 6(6 + 9 B - l)bi (a, b, 0; k, I) 

= a(a — l)p\{a — 1, 6, 0; k, I) 

+ b(b-l)p 1 (a,b-l,0;k,l) 

+ 6Aapi(a — 1, b, 0; k — 1, /) 

+ 9 B bpi(a,b- 1,0; A;,/ - 1) 

with boundary conditions pi(l, 0, 0; fc, I) = pi(0, 1,0; fc, I) =0, for k,l = 0, 
. . . ,n. It therefore follows (e.g., by induction) that pi(a, b, 0; k, I) = 0. 

APPENDIX: EXPRESSION FOR <r(A,B,C) 

We use Q A to denote q A (a + ca), Q A to denote q A (a + ca — e^), Q^. to 
denote g A (a + — — e^), and so on. Then 



(r(a,b,c) 



(c-l)(c + l)(3c-2) 



+ (c - l)(3a + 36 + 2c - 1) + 6a6 



Q A Q B 



9a(c-1) 



^25 ai ,odci.,iQtQ B - ° B ° 9 1 ^fy j ,o<Sc. J -,iQ A Qf 



A" 



+E 

t=i 



t=i 

c(c - 3) + 2a + 46 - 4 



ft" r 



i=l 



(26 + 0-1)^.(^ + ^.-1) 
5 — 6a,- — 4c,-. 



6 



Ci.(Cj. - 1) 

QtQ B 

Ci .{ Ci .-l)Q A Q B 



j'=i 



6I B -c(c- 3) + 26 + 4a -4 



C.j (c.j - 1) 



(2a + c- l)c.j(6j + c.j - 1) 



Q A Qf 
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+ § Vifcv 

i,k=l 



B 5-66,-40., 

T <5c,,2 + g 



^■(cj-lJQ^Qg 



if L 



i=l J = l 



if L 



i=i i=i 



Cj.(cj. - l)& 3 -(a_j - 1) 



+ 



&ij\(Hj H - 1 2c^. H - 2cj.c.j 2c. 



+ Cijbj(a. - 1) + Cija,i(c.j - 1) + 2aibjCij 
Ob 

+ — 4 .o'X- i(Vv. - 1) 



+ y<WV...:(V,,. i •:<•., - 1) 



+ 5 £ Oj + Cj. - 1 - y(5 Ci . j2 X] c 4j (qj - l)QfiQ 

i=l - 

i L r 

+ ^Y b 3 + c -i _ 1 - Y Scj ' 2 ^l^ipij-^QiQ. 
j=i L J i=i 

. if L if 

i=l j=l fc=l 

if L L 

i=i j=i ;=i 

^ if L if £ 

+ 8 Y Y Y Y Ci i ~ 1 ) c ^ ~ 1 ) < 3il ( 3 
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'Of 



i=l j=l k=l 1=1 
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- 12 EE C ^ " ^ ~ l )QtiQfr 

i=l j=l 

To check the correctness of the above expression, we also solved the recur- 
sion (5.12) numerically for all sample configurations of sizes n = 10,20, and 
30 (with K, L < 2), and confirmed that the above analytic expression agreed 
in all cases. We also implemented a Mathematica program to solve q(a, b, c) 
exactly in the special case K = L = 1. The program can return series expan- 
sions in terms of p^ 1 as p — > oo which are symbolic in 9a and 6b- We could 
then compare the first three terms against qo, qi, and q2 , for various sample 
configurations (a,b,c). 
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